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(54) Method and device for voice activity detection and a communication device 



(57) The invention concerns a voice activity detec- 
tion device in which an input speech signal (x(n)) is 
divided in subsignals (S(s)) representing specific fre- 
quency bands and noise (N(s)) is estimated in the sub- 
signals. On basis of the estimated noise in the 
subsignals, subdecision signals (SNR(s)) are generated 
and a voice activity decision (V jnd ) for the input speech 
signal is formed on basis of the subdecision signals. 
Spectrum components of the input speech signal and a 
noise estimate are calculated and compared. More spe- 
cifically a signal-to-noise ratio is calculated for each 
subsignal and each signal-to-noise ratio represents a 
subdecision signal (SNR(s)). From the signal-to-noise 
ratios a value proportional to their sum is calculated and 
compared with a threshold value and a voice activity 
decision signal (V ind ) for the input speech signal is 
formed on basis of the comparison. 
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Description 



This invention relates to a voice activity detection device comprising means for detecting voice activity in an input 
signal, and for making a voice activity decision on basis of the detection. Likewise the invention relates to a method for 

5 detecting voice activity and to a communication device including voice activity detection means. • 

A Voice Activity Detector (VAD) determines whether an input signal contains speech or background noise. A typical 
application for a VAD is in wireless communication systems, in which the voice activity detection can be used for con- 
trolling a discontinuous transmission system, where transmission is inhibited when speech is not detected. A VAD can 
also be used in e.g. echo cancellation and noise cancellation. 

10 Various methods for voice activity detection are known in prior art. The main problem is to reliably detect speech 
from background noise in noisy environments. Patent publication US 5,459,814 presents a method for voice activity 
detection in which an average signal level and zero crossings are calculated for the speech signal. The solution 
achieves a method which is computationally simple, but which has the drawback that the detection result is not very reli- 
able. Patent publications WO 95/081 70 and US 5,276,765 present a voice activity detection method in which a spectral 

75 difference between the speech signal and a noise estimate is calculated using LPC (Liner Prediction Coding) parame- 
ters. These publications also present an auxiliary VAD detector which controls updating of the noise estimate. The VAD 
methods of all the above mentioned publications have problems to reliably detect speech when speech power is low 
compared to noise power. 

The present invention concerns a voice activity detection device in which an input speech signal is divided in sub- 

20 signals representing specific frequency bands and voice activity is detected in the subsignals. On basis of the detection 
of the subsignals, subdecision signals are generated and a voice activity decision for the input speech signal is formed 
on basis of the subdecision signals. In the invention spectrum components of the input speech signal and a noise esti- 
mate are calculated and compared. More specifically a signal-to-noise ratio is calculated for each subsignal and each 
signal-to-noise ratio represents a subdecision signal. From the signal-to-noise ratios a value proportional to their sum 

25 is calculated and compared with a threshold value and a voice activity decision signal for the input speech signal is 
formed on basis of the comparison. 

For obtaining the signal-to-noise ratios for each subsignal a noise estimate is calculated for each subfrequency 
band (i.e. for each subsignal). This means that noise can be estimated more accurately and the noise estimate can also 
be updated separately for each subfrequency band. A more accurate noise estimate will lead to a more accurate and 

30 reliable voice activity detection decision. Noise estimate accuracy is also improved by using the speech/noise decision 
of the voice activity detection device to control the updating of the background noise estimate. 

A voice activity detection device and a communication device according to the invention is characterized by that it 
comprises means for dividing said input signal in subsignals representing specific frequency bands, means for estimat- 
ing noise in the subsignals, means for calculating subdecision signals on basis of the noise in the subsignals, and 

35 means for making a voice activity decision for the input signal on basis of the subdecision signals. 

A method according to the invention is characterized by that it comprises the steps of dividing said input signal in 
subsignals representing specific frequency bands, estimating noise in the subsignals, calculating subdecision signals 
on basis of the noise in the subsignals, and making a voice activity decision for the input signal on basis of the subde- 
cision signals. 

40 In the following, the invention is illustrated in more detail, referring to the enclosed figures, in which 
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Figure 1 shows shortly the surroundings of use of the voice activity detection device 4 according to the invention. 
The parameter values presented in the following description are exemplary values and describe one embodiment of the 
invention, but they do not by any means limit the function of the method according to the invention to only certain param- 
eter values. Referring to figure 1 a signal coming from a microphone 1 is sampled in an A/D converter 2. As exemplary 
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values it is assumed that the sample rate of the A/D converter 2 is 8000 Hz, the frame length of the speech codec 3 is 
80 samples, and each speech frame comprises 10 ms of speech. The VAD device 4 can use the same input frame 
length as the speech codec 3 or the length can be an even quotient of the frame length used by the speech codec. The 
coded speech signal is fed further in a transmission branch, e.g. to a discontinous transmission handler 5, which con- 

5 trols transmission according to a decision V ind received from the VAD 4. 

One embodiment of the voice activity detection device according to the invention is described in more detail in fig- 
ure 2. A speech signal coming from the microphone 1 is sampled in an A/D-converter 2 into a digital signal x(n). An input 
frame for the VAD device in Fig. 2 is formed by taking samples from digital signal x(n). This frame is fed into block 6, in 
which power spectrum components presenting power in predefined bands are calculated. Components proportional to 

io amplitude or power spectrum of the input frame can be calculated using an FFT, a filter bank, or using linear predictor 
coefficients. This will be explained in more detail later. If the VAD operates with a speech codec that calculates linear 
prediction coefficients then those coefficients can be received from the speech codec. 

Power spectrum components P(f) are calculated from the input frame using first Fast Fourier Transform (FFT) as 
presented in figure 3. In the example solution it is assumed that the length of the FFT calculation is 128. Additionally, 

75 power spectrum components P(f) are recombined to calculation spectrum components S(s) reducing the number of 
spectrum components from 65 to 8. 

Referring to Fig. 3 a speech frame is brought to windowing block 10, in which it is multiplied by a predetermined 
window. The purpose of windowing is in general to enhance the quality of the spectral estimate of a signal and to divide 
the signal into frames in time domain. Because in the windowing used in this example windows partly overlap, the over- 

20 lapping samples are stored in a memory (block 15) for the next frame. 80 samples are taken from the signal and they 
are combined with 16 samples stored during the previous frame, resulting in a total of 96 samples. Respectively out of 
the last collected 80 samples, the last 16 samples are stored for being used in calculating the next frame. 

The 96 samples given this way are multiplied in windowing block 10 by a window comprising 96 sample values, the 
8 first values of the window forming the ascending strip \ u of the window, and the 8 last values forming the descending 

25 strip l D of the window, as presented in figure 7. The window l(n) can be defined as follows and is realized in block 1 1 
(figure 6): 

^(n+IJ/Qsly n=0,..,7 (1) 
30 |(n)=1 = l M n=8,..,87 

l(n)=(96-n)/9= l D n=88,..,95 

Realizing of windowing (block 11) digitally is prior known to a person skilled in the art of digital signal processing. 

35 It has to be notified that in the window the middle 80 values (n=8,..87 or the middle strip l M ) are equal to 1 , and accord- 
ingly multiplication by them does not change the result and the multiplication can be omitted. Thus only the first 8 sam- 
ples and the last 8 samples in the window need to be multiplied. Because the length of an FFT has to be a power of 
two, in block 12 (figure 6) 32 zeroes (0) are added at the end of the 96 samples obtained from block 1 1 , resulting in a 
speech frame comprising 1 28 samples. Adding samples at the end of a sequence of samples is a simple operation and 

40 the realization of block 1 2 digitally is within the skills of a person skilled in the art. 

After windowing has been carried out in windowing block 10, the spectrum of a speech frame is calculated in block 
20 employing the Fast Fourier Transform, FFT. Samples x(0),x(1),..,x(n); n=127 (or said 128 samples) in the frame arriv- 
ing to FFT block 20 are transformed to frequency domain employing real FFT (Fast Fourier Transform), giving frequency 
domain samples X(0),X(1),..,X(f);f=64 (more generally f=(n+1)/2) , in which each sample comprises a real component 

45 Xft) and an imaginary component Xfi)\ 

X(/)=X r (/) + Wf),f=0,,64 (2) 

Realizing Fast Fourier Transform digitally is prior known to a person skilled in the art. The real and imaginary com- 
so ponents obtained from the FFT are squared and added together in pairs in squaring block 50, the output of which is the 
power spectrum of the speech frame. If the FFT length is 128, the number of power spectrum components obtained is 
65, which is obtained by dividing the length of the FFT transformation by two and incrementing the result with 1 , in other 
words the length of FFT/2 + 1 . Accordingly, the power spectrum is obtained from squaring block 50 by calculating the 
sum of the second powers of the real and imaginary components, component by component: 

55 

P(f)=X 2 r (f)+x]{f) t f=0,...64 (3) 

The function of squaring block 50 can be realized, as is presented in figure 8, by taking the real and imaginary com- 
ponents to squaring blocks 51 and 52 (which carry out a simple mathematical squaring, which is prior known to be car- 
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ried out digitally) and by summing the squared components in a summing unit 53. In this way, as the output of squaring 
block 50, power spectrum components P(0), P(1), ...P(f);f=64 are obtained and they correspond to the powers of the 
components in the time domain signal at different frequencies as follows (presuming that 8 kHz sampling frequency is 
used): 

5 

P(f) for values / = 0,...,64 corresponds to middle frequencies (f • 4000/64 Hz) (4) 

After this 8 new power spectrum components, or power spectrum component combinations S(s), s =0,..7 are 
formed in block 60 and they are here called calculation spectrum components. The calculation spectrum components 
10 S(s) are formed by summing always 7 adjacent power spectrum components P(f) for each calculation spectrum com- 
ponent S(s) as follows: 

S(0)= P(1)+P(2)+..+P(7) (5) 
15 S(1)=P(8)+P(9)+..+P(14) 

S(2)= P(15)+P(16)+..+P(21) 
S(3)= P(22)+..+P(28) 

20 

S(4)= P(29)+..+P(35) 
S(5)= P(36)+..+P(42) 

25 S(6)= P(43)+..+P(49) 

S(7)= P(50)+..+P(56) 

This can be realized, as presented in figure 9, utilizing counter 61 and summing unit 62, so that the counter 61 
30 always counts up to seven and, controlled by the counter, summing unit 62 always sums seven subsequent compo- 
nents and produces a sum as an output. In this case the lowest combination component S(0) corresponds to middle 
frequencies [62.5 Hz to 437.5 Hz] and the highest combination component S(7) corresponds to middle frequencies 
[3125 Hz to 3500 Hz]. The frequencies lower than this (below 62.5 Hz) or higher than this (above 3500 Hz) are not 
essential for speech and can be ignored. 
35 Instead of using the solution of Figure 3, power spectrum components P(f) can also be calculated from the input 

frame using a filter bank as presented in figure 4. The filter bank comprises bandpass filters Hy(z), j=0 7; covering the 

frequency band of interest. The filter bank can be either uniform or composed of variable bandwidth filters. Typically, the 
filter bank outputs are decimated to improve efficiency. The design and digital implementation of filter banks is known 
to a person skilled in the art. Sub-band samples z y (/) in each band j are calculated from the input signal x(n) using filter 
40 Hj(z). Signal power at each band can be calculated as follows: 

s o*) = 2*//) •*/</) (6) 

45 

where, L is the number of samples in the sub-band within one input frame. 

When a VAD is used with a speech codec, the calculation spectrum components S(s) can be calculated using Lin- 
ear Prediction Coefficients (LPC), which are calculated by most of the speech codecs used in digital mobile phone sys- 

50 terns. Such an arrangement is presented in figure 5. LPC coefficients are calculated in a speech codec 3 using a 
technique called linear prediction, where a linear filter is formed. The LPC coefficients of the filter are direct order coef- 
ficients d(i), which can be calculated from autocorrelation coefficients ACF(k). As will be shown below, the direct order 
coefficients d(i) can be used for calculating calculation spectrum components S(s). The autocorrelation coefficients 
ACF(k), which can be calculated from input frame samples x(n), can be used for calculating the LPC coefficients. If LPC 

55 coefficients or ACF(k) coefficients are not available from the speech codec, they can be calculated from the input frame. 
Autocorrelation coefficients ACF(k) are calculated in the speech codec 3 as follows: 



4 



EP 0 784 311 A1 



N 

ACF(k) = £*(/)*(' - k=0,1,..,M (7) 



/-A 

5 where, 

N is the number of samples in the input frame, 

M is the LPC order (e.g., 8), and 

x(i) are the samples in the input frame. 

JO 

LPC coefficients d(i), which present the impulse response of the short term analysis filter, can be calculated from 
the autocorrelation coefficients ACF(k) using a previously known method, e.g.. the Schur recursion algorithm or the 
Levinson-Durbin algorithm. 

Amplitude at desired frequency is calculated in block 8 shown in figure 5 from the LPC values using Fast Fourier 
15 Transform (FFT) according to following equation: 



M-1 

£c/(0e' 



(8) 



20 



where, 

K is a constant, e.g. 8000 

25 k corresponds to a frequency for which power is calculated (i.e., A(k) corresponds to frequency k/K*fs , where fs is 
the sample frequency), and 
M is the order of the short term analysis. 



30 



35 



The amplitude of a desired frequency band can be estimated as follows 

(9) 



A(/d,/c2) ss ~"~77 



M-1 

£d(/)C(*U2,/) 



where 

k1 is the start index of the frequency band and k2 is the end index of the frequency band. 

40 The coefficients C(/c1 ,k2,i) can be calculated forehand and they can be saved in a memory (not shown) to reduce 
the required computation load. These coefficients can be calculated as follows: 

C(*1.*2,Q- £ e ' ii2nM< 

45 



An approximation of the signal power at calculation spectrum component S(s) can be calculated by inverting the 
square of the amplitude A(k1 ,k2) and by multiplying with ACF(O). The inversion is needed because the linear predictor 
so coefficients presents inverse spectrum of the input signal. ACF(O) presents signal power and it is calculated in the equa- 
tion 7. 



55 



S( S ) = ^5L 



A(/c1,/c2) 2 

where each calculation spectrum component S(s) is calculated using specific constants k1 and k2 which define the 
band limits. 

Above different ways of calculating the power (calculation) spectrum components S(s) have been described. 
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Further in Fig. 2 the spectrum of noise N(s), s=0,..,7 is estimated in estimation block 80 (presented in more detail 
in figure 11) when the voice activity detector does not detect speech. Estimation is carried out in block 80 by calculating 
recursively a time-averaged mean value for each spectrum component S(s), s=0,..,7 of the signal brought from block 6: 



N n (shX(s)N n ^s)H1'k{s))S{s) s = 0 7. 



(12) 



In this context N^fs) means a calculated noise spectrum estimate for the previous frame, obtained from memory 
83. as presented in figure 1 1 , and N n (s) means an estimate for the present frame (n = frame order number) according 
to the equation above. This calculation is carried out preferably digitally in block 81 , the inputs of which are the spectrum 
io components S(s) from block 6, the estimate for the previous frame N^fs) obtained from memory 83 and the value for 
time-constant variable X{s) calculated in block 82. The updating can be done using faster time-constant when input 
spectrum components are S(s) lower than noise estimate N^fs) components. The value of the variable Us) is deter- 
mined according to the next table (typical values for X{$)) : 



15 



20 



25 



S(S) < N^fs) 


(Vind. ST count ) 


Ms) 


Yes 


(0.0) 


0.85 


No 


(0,0) 


0.9 


Yes 


(0,1) 


0.85 


No 


(0.1) 


0.9 


Yes 


d.O) 


0.9 


No 


(1.0) 


1 (no updating) 


Yes 


(1.1) 


0.9 


No 


(1.1) 


0.95 



30 

The values V jnd and ST count are explained more closely later on. 

In following the symbol N(s) is used for the noise spectrum estimate calculated for the present frame. The calcula- 
tion according to the above estimation is preferably carried out digitally. Carrying out multiplications, additions and sub- 
tractions according to the above equation digitally is well known to a person skilled in the art. 
35 Further in Fig. 2 a ratio SNR(s), s=0,..,7 is calculated from input spectrum S(s) and noise spectrum N(s), compo- 
nent by component, in calculation block 90 and the ratio is called signal-to-noise ratio: 



40 



SNR{s) . 



S(s) 
N(sY 



(13) 



The signal-to-noise ratios SNR(s) represent a kind of voice activity decisions for each frequency band of the calcu- 
lation spectrum components. From the signal-to-noise ratios SNR(s) it can be determined whether the frequency band 
signal contains speech or noise and accordingly it indicates voice activity. The calculation block 90 is also preferably 
45 realized digitally, and it carries out the above division. Carrying out a division digitally is as such prior known to a person 
skilled in the art 

In Fig. 2 relative noise level is calculated in block 70, which is more closely presented in figure 1 0, and in which the 
time averaged mean value for speech S(n) is calculated using the power spectrum estimate S(s), S=0,..,7. The time 
averaged mean value S{n) is updated when speech is detected. First the mean value ~S(n) of power spectrum compo- 
se nents in the present frame is calculated in block 71 , into which spectrum components S(s) are obtained as an input from 
block 60, as follows: 



55 



(14) 



s-o 



The time averaged mean value S(n) is obtained by calculating in block 72 (e.g., recursively) based upon a time 
averaged mean value S{n - l)for the previous frame, which is obtained from memory 78, in which the calculated time 
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averaged mean value has been stored during the previous frame, the calculation spectrum mean value 3(n) obtained 
from block 71 , and time constant a which has been stored in advance in memory 79a: 

S(n)=aS(n-1)+(1-a)S(n), (15) 

5 

in which n is the order number of a frame and a is said time constant, the value of which is from 0.0 to 1.0, typically 
between 0.9 to 1 .0. In order not to contain very weak speech in the time averaged mean value (e.g. at the end of a sen- 
tence), it is updated only if the mean value of the spectrum components for the present frame exceeds a threshold value 
dependent on time averaged mean value. This threshold value is typically one quarter of the time averaged mean value. 
w The calculation of the two previous equations is preferably executed digitally. 

Correspondingly, the time averaged mean value of noise power A/(n) is obtained from calculation block 73 by using 
the power spectrum estimate of noise N(s), s=0,..,7 and component mean value ~R{n) calculated from it according to 
the next equation: 

75 A/( n )-pA/(n-1)+(1-p)A/(n), (16) 



in which p is a time constant, the value of which is 0.0. to 1 .0, typically between 0.9 to 1 .0. The noise power time aver- 
aged mean value is updated in each frame. The mean value of the noise spectrum components Ti{n) is calculated in 
block 76, based upon spectrum components N(s), as follows: 

N{n)^N{s) (17) 

s=0 



and the noise power time averaged mean value N{n - 1) for the previous frame is obtained from memory 74, in which 
it was stored during the previous frame. The relative noise level n is calculated in block 75 as a scaled and maximum 
limited quotient of the time averaged mean values of noise and speech 



30 



T| = min^mar_r|,K ^ j , (18) 



in which k is a scaling constant (typical value 4.0), which has been stored in advance in memory 77, and max_n is the 
maximum value of relative noise level (typically 1 .0), which has been stored in memory 79b. 

For producing a VAD decision in the device in Fig. 2, a distance D SNR between input signal and noise model is cal- 
culated in the VAD decision block 1 10 utilizing signal-to-noise ratio SNR(s) , which by digital calculation realizes the fbl- 
40 lowing equation: 



$_h 

°snr= Z » 9 SNR(s)\ (19) 

45 



in which $_l and s_h are the index values of the lowest and highest frequency components included and v s = compo- 
nent weighting coefficient, which are predetermined and stored in advance in a memory, from which they are retrieved 
for calculation. Typically, all signal-to-noise estimate value components are used (sj=0 and s_h=7), and they are 

so weighted equally: v s = 1.0/8.0; s=0,..,7. 

The following is a closer description of the embodiment of a VAD decision block 1 10, with reference to figure 12. A 
summing unit 1 1 1 in the voice activity detector sums the values of the signal-to-noise ratios SNH(s), obtained from dif- 
ferent frequency bands, whereby the parameter D SNR , describing the spectrum distance between input signal and 
noise model, is obtained according to the above equation (19), and the value D S nr from the summing unit 11 1 is com- 

55 pared with a predetermined threshold value vth in comparator unit 1 1 2. If the threshold value vth is exceeded, the frame 
is regarded to contain speech. The summing can also be weighted in such a way that more weight is given to the fre- 
quencies, at which the signal-to-noise ratio can be expected to be good. The output and decision of the voice activity 
detector can be presented with a variable V jnd , for the values of which the following conditions are obtained: 
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25 



30 



50 



55 



K, = 0; 



vth 

(2°) 

vth 



Because the VAD controls the updating of background spectrum estimate N(s), and the latter on its behalf affects 
w the function of the voice activity detector in a way described above, it is possible that both noise and speech is indicated 
as speech (V jnda1 ) if the background noise level suddenly increases. This further inhibits update of the background 
spectrum estimate N(s). To prevent this, the time (number of frames) during which subsequent frames are regarded not 
to contain speech is monitored. Subsequent frames, which are stationary and are not indicated voiced are assumed not 
to contain speech. 

15 In block 7 in figure 2, Long Term Prediction (LTP) analysis, which is also called pitch analysis, is calculated. Voiced 
detection is done using long term predictor parameters. The long term predictor parameters are the lag (i.e. pitch 
period) and the long term predictor gain. Those parameters are calculated in most of the speech coders. Thus if a voice 
activity detector is used besides a speech codec (as described in Fig. 5), those parameters can be obtained from the 
speech codec. 

20 The long term prediction analysis can be calculated from an amount of samples M which equals frame length N, or 
the input frame length can be divided to sub-frames (e.g. 4 sub-frames, 4*M=N ) and long term parameters are calcu- 
lated separately from each sub-frame. The division of the input frame into these sub-frames is done in the LTP analysis 
block 7 (Fig. 2). The sub-frame samples are denoted xs(i). 

Accordingly, in block 7 first auto-correlation R(l) from the sub-frame samples xs(i) is calculated, 



M 

fl(/)=£xs(/Vxs(/-/) (21) 
/-o 



where 

l=Lmin Lmax (e.g. Lmin=40, Lmax=160) 

35 Last Lmax samples from the old sub-frames must be saved for the above mentioned calculation. 

Then a maximum value Rmax from the R(l) is searched so that Rmax=max(R(l)), where l=40,...,160. 
The long term predictor lag LTP_lag(j) is the index I with corresponds to Rmax. Variable j indicates the index of the 
sub-frame G=0..3). 

LTP_gain can be calculated as follows: 
40 LTPjgain(j)=Rmax/Rtot 
where 

N 

Rtot = £xs(/ - LTPJagU)) 2 (22) 

45 /o0 



A parameter presenting the long term predictor lag gain of a frame (LTP_gain_sum) can be calculated by summing 
the long term predictor lag gains of the sub-frames (LTP jgainfj)) 



3 

LTPjgain_sum = £fTP_0a/n(y) (23) 



If the LTP_gain_sum is higher than a fixed threshold thrjag, the frame is indicated to be voiced: 
If (LTPjgain_sum > thrjag) 
voiced = 1 

else 
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voiced = 0 

Further in Fig 2 an average noise spectrum estimate NA(s) is calculated in block 100 as follows: 

NA n (s) = aNA n ^{s)H^a)S(s) s = 0.....7 (24) 

5 

where a is a time constant of value 0<a<1 (e.g. 0,9). 

Also a spectrum distance D between the average noise spectrum estimate NA(s) and the spectrum estimate S(s) 
is calculated in block 100 as follows: 

D = y max( A/A (s),S(s)) (25) 

^ o mm(NA{s) t S{s) t Low_Limit) K ' 



15 LowJJmit is a small constant, which is used to keep the division result small when the noise spectrum or the signal 
spectrum at some frequency band is low. 

If the spectrum distance D is larger than a predetermined threshold Dlim, a stationarity counter stat_cnt is set to 
zero. If the spectrum distance D is smaller that the threshold Dlim and the signal is not detected voiced (voiced = 0), the 
stationarity counter is incremented. The following conditions are received for the stationarity counter: 
20 If (D > Dlim) 
stat_cnt = 0 
if (D<Dlim and voiced =0) 
stat_cnt = stat_cnt+1 

Block 100 gives an output stat_cnt which is reset to zero when V ind gets a value 0 to meet the following condition: 
25 if(V ind = 0) 
stat_cnt =0 

If this number of subsequent frames exceeds a predetermined threshold value maxjspf, the value of which is e.g. 
50, the value of ST CO unt »s set at 1. This provides the following conditions for an output ST C ount > n relation to the 
counter value stat_cnt: 
30 If (stat_cnt > maxjspf) 
STcount = 1 



STcount = 0 

Additionally, in the invention the accuracy of background spectrum estimate N(s) is enhanced by adjusting said 
35 threshold value vth of the voice activity detector utilizing relative noise level r\ (which is calculated in block 70). In an 
environment in which the signal-to-noise ratio is very good (or the relative noise level r| is low), the value of the threshold 
vth is increased based upon the relative noise level r\. Hereby interpreting rapid changes in background noise as 
speech is reduced. Adaptation of the threshold value vth is carried out in block 113 according to the following: 

40 vth'\ = max(vtf?_m/n1, vthJixA - vth_siope^ *r\), (26) 

in which vthjixl, vth_min1, and vth_slope1 are positive constants, typical values for which are e.g.: vthjix1=2.b\ 
vth_min1=2.0; vth_slope1 =8.0. 

In an environment with a high noise level, the threshold is decreased to decrease the probability that speech is 
45 detected as noise. The mean value of the noise spectrum components /V(n) is then used to decrease the threshold vth 
as follows 

vth2 = min( vth 1 , vthjix2 - vth_slope2 • N(n)) (27) 

so in which vthj ix2 and vth_slope2 are positive constants. Thus if the mean value of the noise spectrum components A/( n ) 
is large enough, the threshold vht2 is lower that the theshold vth1 . 

The voice activity detector according to the invention can also be enhanced in such a way that the threshold vth2 
is further decreased during speech bursts. This enhances the operation, because as speech is slowly becoming more 
quiet it could happen otherwise that the end of speech will be taken for noise. The additional threshold adaptation can 

55 be implemented in the following way (in block 113): 

First, D SNR is limited between the desired maximum (typically 5) and minimum (typically 2) values according to the fol- 
lowing conditions: 
0=D SNR 
» D < D min 
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D=D„ 
if D > D max 



D=D m(U 



After this a threshold adaptation coefficient ta 0 is calculated by 



»0 = '"max - D D " P n" ('"max mln>. (28) 



'max ^mln 



jo where tf7 min and tf7 max are the minimum (typically 0.5) and maximum (typically 1) scaler values, respectively. 

The actual scaler for frame n, ta(n) t is calculated by smoothing teo with a filter with different time constants for 
increasing and decreasing values. The smoothing may be performed according to following equations: 
if fa 0 >/a(n-1) 

ta(n) = Va(n-1)+(1 ^) ta o (29) 

75 else 

te(n)=>. 1 fa(n-1)+(1->. 1 )fao 

Here Xq and are the attack (increase period; typical value 0.9) and release (decrease period; typical value 0.5) 
time constants. Finally, the scaler ta(n) can be used to scale the threshold vth in order to obtain a new VAD threshold 
value vth, whereby 

20 

vth = ta{n) • vth2 (30) 

An often occurring problem in a voice activity detector is that just at the beginning of speech the speech is not 
detected immediately and also the end of speech is not detected correctly. This, on its beharf, causes that the back- 

25 ground noise estimate N(s) gets an incorrect value, which again affects later results of the voice activity detector. This 
problem can be eliminated by updating the background noise estimate using a delay. In this case a certain number N 
(e.g. N=2) of power spectra (here calculation spectra) S 1 (s),...,S N (s) of the last frames are stored (e.g. in a buffer imple- 
mented at the input of block 80, not shown in figure 1 1) before updating the background noise estimate N(s). If during 
the last double amount of frames (or during 2*N frames) the voice activity detector has not detected speech, the back- 

30 ground noise estimate N(s) is updated with the oldest power spectrum S f (s) in memory, in any other case updating is 
not done. With this it is ensured, that N frames before and after the frame used at updating have been noise. 

The method according to the invention and the device for voice activity detection are particularly suitable to be used 
in communication devices such as a mobile station or a mobile communication system (e.g. in a base station), and they 
are not limited to any particular architecture (TDMA, CDMA, digital/analog). Figure 13 presents a mobile station accord- 

35 ing to the invention, in which voice activity detection according to the invention is employed. The speech signal to be 
transmitted, coming from a microphone 1 , is sampled in an A/D converter 2, is speech coded in a speech codec 3, after 
which base frequency signal processing (e.g. channel encoding, interleaving), mixing and modulation into radio fre- 
quency and transmittance is performed in block TX. The voice activity detector 4 (VAD) can be used for controlling dis- 
continous transmission by controlling block TX according to the output V ind of the VAD. If the mobile station includes an 

40 echo and/or noise canceller ENC, the VAD 4 according to the invention can also be used in controlling block ENC. From 
block TX the signal is transmitted through a duplex filter DPLX and an antenna ANT. Tlie known operations of a recep- 
tion branch RX are carried out for speech received at reception, and it is repeated through loudspeaker 9. The VAD 4 
could also be used for controlling any reception branch RX operations, e.g. in relation to echo cancellation. 

Here realization and embodiments of the invention have been presented by examples on the method and the 

45 device. It is evident for a person skilled in the art that the invention is not limited to the details of the presented embod- 
iments and that the invention can be realized also in another form without deviating from the characteristics of the inven- 
tion. The presented embodiments should only be regarded as illustrating, not limiting. Thus the possibilities to realize 
and use the invention are limited only by the enclosed claims. Hereby different alternatives for the implementing of the 
invention defined by the claims, including equivalent realizations, are included in the scope of the invention. 

50 

Claims 

1 . A voice activity detection device comprising 

55 means for detecting voice activity in an input signal (x(n)), and 

means for making a voice activity decision (V jnd ) on basis of the detection, characterized in that it comprises 

means (6) for dividing said input signal (x(n)) in subsignals (S(s)) representing specific frequency bands, 
means (80) for estimating noise (N(s)) in the subsignals, 
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means (90) for calculating subdecision signals (SNR(s)) on basis of the noise in the subsignals, and 
means (110) for making a voice activity decision (V^ for the input signal on basis of the subdecision sig- 
nals. 

5 2. A voice activity detection device according to claim 1 , characterized in that it comprises means (90) for calculating 
a signal-to-noise ratio (SNR) for each subsignal and for providing said signal-to-noise ratios as subdecision signals 
(SNR(s)). 

3. A voice activity detection device according to claim 2, characterized in that the means (1 10) for making a voice 
w activity decision (V ind ) for the input signal comprises 

means (1 1 1) for creating a value (D SNR ) based on said signal-to-noise ratios (SNR(s)), and 

means (112) for comparing said value (D SNR ) with a threshold value (vth) and for outputting a voice activity 

decision signal (V ind ) on basis of said comparison. 

15 

4. A voice activity detection device according to claim 1 , characterized in that it comprises means (70) for determin- 
ing the mean level of a noise component and a speech component (A/, S) contained in the input signal, and means 
(113) for adjusting said threshold value (vth) based upon the mean level of the noise component and the speech 
component (A/.S). 

20 

5. A voice activity detection device according to claim 2, characterized in that it comprises means (1 1 3) for adjusting 
said threshold value (vth) based upon past signal-to-noise ratios (SNR(s)). 

6. A voice activity detection device according to claim 2, characterized in that it comprises means (80) for storing the 
25 value of the estimated noise (N(s)) and said noise (N(s)) is updated with past subsignals (S(s)) depending on past 

and present signal-to-noise ratios (SNR(s)). 

7. A voice activity detection device according to claim 1 , characterized in that it comprises means (3) for calculating 
linear prediction coefficients based on the input signal (x(n)), and means (8) for calculating said subsignals (S(s)) 

30 based on said linear prediction coefficients. 

8. A voice activity detection device according to claim 1 , characterized in that it comprises 

means (7) for calculating a long term prediction analysis producing long term predictor parameters, said 
35 parameters including long term predictor gain (LTPj3ain_sum), 

means (7) for comparing said long term predictor gain with a threshold value (thrjag), and 
means for producing a voiced detection decision on basis of said comparison. 

9. A mobile station for transmission and reception of speech messages, comprising 

40 

means for detecting voice activity in a speech message (x(n)), and 

means for making a voice activity decision (V lnd ) on basis of the detection, characterized in that it comprises 

means (6) for dividing said speech message (x(n)) in subsignals (S(s)) representing specific frequency 
45 bands, 

means (80) for estimating noise (N(s)) in the subsignals, 

means (90) for calculating subdecision signals (SNR(s)) on basis of the noise in the subsignals, and 
means (1 10) for making a voice activity decision (V^ for the input signal on basis of the subdecision sig- 
nals. 

50 

10. A method of detecting voice activity in a communication device, the method comprising the steps of: 

receiving an input signal (x(n)), 
detecting voice activity in the input signal, and 
55 making (1 10) a voice activity decision (V jnd ) on basis of the detection, characterized in that it comprises 

dividing (6) said input signal in subsignals (S(s)) representing specific frequency bands, 
estimating noise (N(s)) in the subsignals, 

calculating (90) subdecision signals (SNR(s)) on basis of the noise in the subsignals, and 
making (1 10) a voice activity decision (V jnd ) for the input signal on basis of the subdecision signals. 
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Fig. 7 
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